25 - Delhi PM2.5 (Normal distribution)

Python
Distributions
Published

August 13, 2025

1 Goal

Today I chose a dataset retrieved from Kaggle. With data regarding the air quality of Delhi, I want to try to create a normal distribution of some of the data

import pandas as pd
import numpy as np
df = pd.read_csv('data/day25/delhi_air_quality.csv')
df.head(5)
Date Month Year Holidays_Count Days PM2.5 PM10 NO2 SO2 CO Ozone AQI
0 1 1 2021 0 5 408.80 442.42 160.61 12.95 2.77 43.19 462
1 2 1 2021 0 6 404.04 561.95 52.85 5.18 2.60 16.43 482
2 3 1 2021 1 7 225.07 239.04 170.95 10.93 1.40 44.29 263
3 4 1 2021 0 1 89.55 132.08 153.98 10.42 1.01 49.19 207
4 5 1 2021 0 2 54.06 55.54 122.66 9.70 0.64 48.88 149
# Get an understanding of the 
df.describe()
Date Month Year Holidays_Count Days PM2.5 PM10 NO2 SO2 CO Ozone AQI
count 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000 1461.000000
mean 15.729637 6.522930 2022.501027 0.189596 4.000684 90.774538 218.219261 37.184921 20.104921 1.025832 36.338871 202.210815
std 8.803105 3.449884 1.118723 0.392116 2.001883 71.650579 129.297734 35.225327 16.543659 0.608305 18.951204 107.801076
min 1.000000 1.000000 2021.000000 0.000000 1.000000 0.050000 9.690000 2.160000 1.210000 0.270000 2.700000 19.000000
25% 8.000000 4.000000 2022.000000 0.000000 2.000000 41.280000 115.110000 17.280000 7.710000 0.610000 24.100000 108.000000
50% 16.000000 7.000000 2023.000000 0.000000 4.000000 72.060000 199.800000 30.490000 15.430000 0.850000 32.470000 189.000000
75% 23.000000 10.000000 2024.000000 0.000000 6.000000 118.500000 297.750000 45.010000 26.620000 1.240000 45.730000 284.000000
max 31.000000 12.000000 2024.000000 1.000000 7.000000 1000.000000 1000.000000 433.980000 113.400000 4.700000 115.870000 500.000000

Thus we see that there is four years of data available, with recordings everyday for those four years. It would now be interesting to plot the PM2.5 column.

df.columns
Index(['Date', 'Month', 'Year', 'Holidays_Count', 'Days', 'PM2.5', 'PM10',
       'NO2', 'SO2', 'CO', 'Ozone', 'AQI'],
      dtype='object')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1461 entries, 0 to 1460
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Date            1461 non-null   int64  
 1   Month           1461 non-null   int64  
 2   Year            1461 non-null   int64  
 3   Holidays_Count  1461 non-null   int64  
 4   Days            1461 non-null   int64  
 5   PM2.5           1461 non-null   float64
 6   PM10            1461 non-null   float64
 7   NO2             1461 non-null   float64
 8   SO2             1461 non-null   float64
 9   CO              1461 non-null   float64
 10  Ozone           1461 non-null   float64
 11  AQI             1461 non-null   int64  
dtypes: float64(6), int64(6)
memory usage: 137.1 KB
import altair as alt

alt.Chart(df).mark_point().encode(
    x='Month',
    y='PM2.5'
)

Can’t plot the PM2.5 column for whatever reason.

df = df.rename(columns={'PM2.5': 'PM2_5'})
# Trying againg with the new column name
alt.Chart(df).mark_point().encode(
    x='Month',
    y='PM2_5'
)

That did the trick. We can clearly see that PM2.5 particals are generally lowest in July-September. With December and January being the worst. There is however an outlier in June with a PM2.5 of a 1000, maybe the instrument that measured couldn’t read above that threshold.

2 Calculating the normal distribution for 2024 of PM2.5

import math
import matplotlib.pyplot as plt
df_2024 = df[df['Year'] == 2024]
def normal_pdf(x, mu=0, sigma=1):
    sqrt_two_pi = math.sqrt(2 * math.pi)
    return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) / (sqrt_two_pi * sigma))
# Storing the mean value of PM2.5 in 2024
mu = df_2024['PM2_5'].mean()

# Storing the standard deviation of PM2.5
sigma = df_2024['PM2_5'].std()
# Remove outlier at 1000 PM2_5
df_2024 = df_2024[df_2024['PM2_5'] < df_2024['PM2_5'].quantile(0.99)]

# Creating a array of continuous values to plot probability for each value. 
# As the pm2_5 column can't be used as-is, due to it missing values in the values between min and max
xs = np.arange(min(df_2024['PM2_5']), max(df_2024['PM2_5']))

# Storing y values of the function
y = []
for x in xs:
    y.append(normal_pdf(x, mu=mu, sigma=sigma))
# plotting distribution
plt.plot(xs, y)
plt.title("Normal distribution of PM2.5 in Delhi 2024")
plt.show()

3 Reflections

We thus have a probability density distribution, where we can understand the probability of PM2.5 being any given value.
Besides calculating the normal distribution, it could be interesting to use linear regression, to be able to approximate the pm2.5 on any given day.